Active learning via a sample consistency assessment

ABSTRACT

A method includes obtaining a set of unlabeled training samples. For each training sample in the set of unlabeled training samples generating, the method includes using a machine learning model and the training sample, a corresponding first prediction, generating, using the machine learning model and a modified unlabeled training sample, a second prediction, the modified unlabeled training sample based on the training sample, and determining a difference between the first prediction and the second prediction. The method includes selecting, based on the differences, a subset of the set of unlabeled training samples. For each training sample in the subset of the set of unlabeled training samples, the method includes obtaining a ground truth label for the training sample, and generating a corresponding labeled training sample based on the training sample paired with the ground truth label. The method includes training the machine learning model using the corresponding labeled training samples.

CROSS REFERENCE TO RELATED APPLICATIONS

This U.S. patent application is a continuation of, and claims priorityunder 35 U.S.C. § 120 from, U.S. patent application Ser. No. 17/000,094,filed on Aug. 21, 2020, which claims priority under 35 U.S.C. § 119(e)to U.S. Provisional Application 62,890,379, filed on Aug. 22, 2019. Thedisclosures of these prior applications are considered part of thedisclosure of this application and are hereby incorporated by referencein their entireties.

TECHNICAL FIELD

This disclosure relates to active learning such as active learning usinga sample consistency assessment.

BACKGROUND

Generally, supervised machine learning models require large amounts oflabeled training data in order to accurately predict results. However,while obtaining large amounts of unlabeled data is often easy, labelingthe data is frequently very difficult. That is, labeling vast quantitiesof data is often inordinately expensive if not outright impossible.Thus, active learning is a popular type of machine learning that allowsfor the prioritization of unlabeled data in order to train a model onlyon data that will have the highest impact (i.e., the greatest increasein accuracy). Typically, an active learning algorithm is first trainedon a small sub-set of labeled data and then may actively query a teacherto label select unlabeled training samples. The process of selecting theunlabeled training samples is an active field of study.

SUMMARY

One aspect of the disclosure provides a method for active learning via asample consistency assessment. The method includes obtaining, by dataprocessing hardware, a set of unlabeled training samples. During each ofa plurality of active learning cycles and for each unlabeled trainingsample in the set of unlabeled training samples, the method includesperturbing, by the data processing hardware, the unlabeled trainingsample to generate an augmented training sample. The method alsoincludes, generating, by the data processing hardware, using the machinelearning model configured to receive the unlabeled training sample andthe augmented training sample as inputs, a predicted label for theunlabeled training sample and a predicted label for the augmentedtraining sample and determining, by the data processing hardware, aninconsistency value for the unlabeled training sample. The inconsistencyvalue represents variance between the predicted label for the unlabeledtraining sample and the predicted label for the augmented trainingsample. The method also includes sorting, by the data processinghardware, the unlabeled training samples in the set of unlabeledtraining samples in descending order based on the inconsistency valuesand obtaining, by the data processing hardware, for each unlabeledtraining sample in a threshold number of unlabeled training samplesselected from the sorted unlabeled training samples in the set ofunlabeled training samples, a ground truth label. The method includesselecting, by the data processing hardware, a current set of labeledtraining samples. The current set of labeled training samples includeseach unlabeled training sample in the threshold number of unlabeledtraining samples selected from the sorted unlabeled training samples inthe set of unlabeled training samples paired with the correspondingobtained ground truth label. The method also includes training, by thedata processing hardware, using the current set of labeled trainingsamples and a proper subset of unlabeled training samples from the setof unlabeled training samples, the machine learning model.

Implementations of the disclosure may include one or more of thefollowing optional features. In some implementations, the thresholdnumber of unlabeled training samples is less than a cardinality of theset of unlabeled training samples. The inconsistency value for eachunlabeled training sample in the threshold number of unlabeled trainingsamples may be greater than the inconsistency value for each unlabeledtraining sample not selected from the sorted unlabeled training samplesin the set of unlabeled training samples.

Optionally, the method further includes obtaining, by the dataprocessing hardware, the proper subset of unlabeled training samplesfrom the set of unlabeled training samples by removing the thresholdnumber of unlabeled training samples from the set of unlabeled trainingsamples. The method may further include selecting, by the dataprocessing hardware, a first M number of unlabeled training samples fromthe sorted unlabeled training samples in the set of unlabeled trainingsamples as the threshold number of unlabeled training samples.

In some examples, the method further includes, during an initial activelearning cycle, randomly selecting, by the data processing hardware, arandom set of unlabeled training samples from the set of unlabeledtraining samples and obtaining, by the data processing hardware,corresponding ground truth labels for each unlabeled training sample inthe random set of unlabeled training samples. The method may alsofurther include training, by the data processing hardware, using therandom set of unlabeled training samples and the corresponding groundtruth labels, the machine learning model. This example may include,during the initial active learning cycle, identifying, by the dataprocessing hardware, a candidate set of unlabeled training samples fromthe set of unlabeled training samples. A cardinality of the candidateset of unlabeled training samples may be less than a cardinality of theset of unlabeled training samples. The method may also further includedetermining, by the data processing hardware, a first cross entropybetween a distribution of ground truth labels and a distribution ofpredicted labels generated using the machine learning model for theunlabeled training samples in the candidate set of unlabeled trainingsamples and determining, by the data processing hardware, a second crossentropy between a distribution of ground truth labels and a distributionof predicted labels generated using the machine learning model for theunlabeled training samples in the set of unlabeled training samples. Themethod may also further include determining, by the data processinghardware, whether the first cross entropy is greater than or equal tothe second cross entropy and, when the first cross entropy is greaterthan or equal to the second cross entropy, selecting, by the dataprocessing hardware, the candidate set of unlabeled training samples asa starting size for initially training the machine learning model.Identifying the candidate set of unlabeled training samples from the setof unlabeled training samples, in some implementations, includesdetermining the inconsistency value for each unlabeled training sampleof the set of unlabeled training samples.

In some implementations, the method may further include, when the firstcross entropy is less than the second cross entropy, randomly selecting,by the data processing hardware, an expanded set of training samplesfrom the unlabeled set of training samples and updating, by the dataprocessing hardware, the candidate set of unlabeled training samples toinclude the expanded set of training samples randomly selected from theunlabeled set of training samples. The method may also further includeupdating, by the data processing hardware, the unlabeled set of trainingsamples by removing each training sample from the expanded set oftraining samples from the unlabeled set of training samples. During animmediately subsequent active learning cycle, the method may alsofurther include determining, by the data processing hardware, the firstcross entropy between a distribution of ground truth labels and adistribution of predicted labels generated using the machine learningmodel for the training samples in the updated candidate set of unlabeledtraining samples and determining, by the data processing hardware, thesecond cross entropy between the distribution of ground truth labels anda distribution of predicted labels generating using the machine learningmodel for the training samples in the updated candidate set of unlabeledtraining samples. The method may also further include determining, bythe data processing hardware, whether the first cross entropy is greaterthan or equal to the second cross entropy. When the first cross entropyis greater than or equal to the second cross entropy, the method mayfurther include selecting, by the data processing hardware, the updatedcandidate set of unlabeled training samples as a starting size forinitially training the machine learning model. In some examples, themachine learning model includes a convolutional neural network.

Another aspect of the disclosure provides data processing hardware andmemory hardware in communication with the data processing hardware. Thememory hardware stores instructions that when executed on the dataprocessing hardware cause the data processing hardware to performoperations. The operations include obtaining a set of unlabeled trainingsamples. During each of a plurality of active learning cycles and foreach unlabeled training sample in the set of unlabeled training samples,the operations include perturbing the unlabeled training sample togenerate an augmented training sample. The operations also include,generating, using the machine learning model configured to receive theunlabeled training sample and the augmented training sample as inputs, apredicted label for the unlabeled training sample and a predicted labelfor the augmented training sample and determining an inconsistency valuefor the unlabeled training sample. The inconsistency value representsvariance between the predicted label for the unlabeled training sampleand the predicted label for the augmented training sample. Theoperations also include sorting the unlabeled training samples in theset of unlabeled training samples in descending order based on theinconsistency values and obtaining, for each unlabeled training samplein a threshold number of unlabeled training samples selected from thesorted unlabeled training samples in the set of unlabeled trainingsamples, a ground truth label. The operations include selecting acurrent set of labeled training samples. The current set of labeledtraining samples includes each unlabeled training sample in thethreshold number of unlabeled training samples selected from the sortedunlabeled training samples in the set of unlabeled training samplespaired with the corresponding obtained ground truth label. Theoperations also include training, using the current set of labeledtraining samples and a proper subset of unlabeled training samples fromthe set of unlabeled training samples, the machine learning model.

This aspect may include one or more of the following optional features.In some implementations, the threshold number of unlabeled trainingsamples is less than a cardinality of the set of unlabeled trainingsamples. The inconsistency value for each unlabeled training sample inthe threshold number of unlabeled training samples may be greater thanthe inconsistency value for each unlabeled training sample not selectedfrom the sorted unlabeled training samples in the set of unlabeledtraining samples.

Optionally, the operations further include obtaining the proper subsetof unlabeled training samples from the set of unlabeled training samplesby removing the threshold number of unlabeled training samples from theset of unlabeled training samples. The operations may further includeselecting a first M number of unlabeled training samples from the sortedunlabeled training samples in the set of unlabeled training samples asthe threshold number of unlabeled training samples.

In some examples, the operations further include, during an initialactive learning cycle, randomly selecting a random set of unlabeledtraining samples from the set of unlabeled training samples andobtaining corresponding ground truth labels for each unlabeled trainingsample in the random set of unlabeled training samples. The operationsmay also further include training, using the random set of unlabeledtraining samples and the corresponding ground truth labels, the machinelearning model. This example may include, during the initial activelearning cycle, identifying a candidate set of unlabeled trainingsamples from the set of unlabeled training samples. A cardinality of thecandidate set of unlabeled training samples may be less than acardinality of the set of unlabeled training samples. The operations mayalso further include determining a first cross entropy between adistribution of ground truth labels and a distribution of predictedlabels generated using the machine learning model for the unlabeledtraining samples in the candidate set of unlabeled training samples anddetermining a second cross entropy between a distribution of groundtruth labels and a distribution of predicted labels generated using themachine learning model for the unlabeled training samples in the set ofunlabeled training samples. The operations may also further includedetermining whether the first cross entropy is greater than or equal tothe second cross entropy and, when the first cross entropy is greaterthan or equal to the second cross entropy, selecting the candidate setof unlabeled training samples as a starting size for initially trainingthe machine learning model. Identifying the candidate set of unlabeledtraining samples from the set of unlabeled training samples, in someimplementations, includes determining the inconsistency value for eachunlabeled training sample of the set of unlabeled training samples.

In some implementations, the operations may further include, when thefirst cross entropy is less than the second cross entropy, randomlyselecting an expanded set of training samples from the unlabeled set oftraining samples and updating the candidate set of unlabeled trainingsamples to include the expanded set of training samples randomlyselected from the unlabeled set of training samples. The operations mayalso further include updating the unlabeled set of training samples byremoving each training sample from the expanded set of training samplesfrom the unlabeled set of training samples. During an immediatelysubsequent active learning cycle, the operations may also furtherinclude determining the first cross entropy between a distribution ofground truth labels and a distribution of predicted labels generatedusing the machine learning model for the training samples in the updatedcandidate set of unlabeled training samples and determining the secondcross entropy between the distribution of ground truth labels and adistribution of predicted labels generating using the machine learningmodel for the training samples in the updated candidate set of unlabeledtraining samples. The operations may also further include determiningwhether the first cross entropy is greater than or equal to the secondcross entropy. When the first cross entropy is greater than or equal tothe second cross entropy, the operations may further include selectingthe updated candidate set of unlabeled training samples as a startingsize for initially training the machine learning model. In someexamples, the machine learning model includes a convolutional neuralnetwork.

The details of one or more implementations of the disclosure are setforth in the accompanying drawings and the description below. Otheraspects, features, and advantages will be apparent from the descriptionand drawings, and from the claims.

DESCRIPTION OF DRAWINGS

FIG. 1 is a schematic view of an example system for training an activelearning model.

FIG. 2 is a schematic view of example components of the system of FIG. 1.

FIGS. 3A-3C are schematic views of components for determining an initialstarting size of labeled training samples.

FIG. 4 is a flowchart of an example arrangement of operations for amethod of active learning via a sample consistency assessment.

FIG. 5 is a schematic view of an example computing device that may beused to implement the systems and methods described herein.

Like reference symbols in the various drawings indicate like elements.

DETAILED DESCRIPTION

As acquiring vast quantities of data becomes cheaper and easier,advances in machine learning are capitalizing by training models usingdeep learning methods on large amounts of data. However, this raises newchallenges, as typically the data is unlabeled which requires labelingprior to use with supervised learning or semi-supervised learningmodels. Conventionally, training data is labeled by human operators. Forexample, when preparing training samples for a model that performsobject detection with frames of image data, an expert annotator (e.g., atrained human) may label frames of image data by drawing a bounding boxaround pedestrians. When the quantity of data is vast, manually labelingthe data is expensive at best and impossible at worst.

One popular approach to the data labeling problem is active learning. Inactive learning, the model is allowed to proactively select a subset oftraining samples from a set of unlabeled training samples and requestthe subset be labeled from an “oracle,” e.g., an expert annotator or anyother entity that may accurately label the selected samples (i.e., the“ground truth” label). That is, active learning modules dynamically posequeries during training to actively select which samples to train on.Active learning has the potential to greatly reduce the overhead oflabeling data while simultaneously increasing accuracy withsubstantially less labeled training samples.

In order to select samples that are useful to improve the target model,selection methods typically depend on outputs and/or intermediatefeatures of the target model to measure unlabeled samples. For example,a method may use entropy of the output to measure uncertainty. Anothermethod may ensure that selected samples cover a large range ofdiversity. Yet another method may use predicted loss to attempt toselect the most valuable samples. However, all of these methods struggleto apply to convolutional neural networks (CNN) when the labeling budgetis small, because typically a large set of labeled data is needed foraccurate CNN models.

Implementations herein are directed toward an active learning modeltrainer that trains a model (e.g., a CNN model) without introducingadditional labeling cost. The trainer uses unlabeled data to improve thequality of the trained model while keeping the number of labeled samplessmall. The trainer is based upon the assumption that a model should beconsistent in its decisions between a sample and a meaningfullydistorted version of the same sample (i.e., a consistency ofpredictions).

Referring to FIG. 1 , in some implementations, an example system 100includes a processing system 10. The processing system 10 may be asingle computer, multiple computers, or a distributed system (e.g., acloud environment) having fixed or scalable/elastic computing resources12 (e.g., data processing hardware) and/or storage resources 14 (e.g.,memory hardware). The processing system 10 executes an active learningmodel trainer 110. The model trainer 110 trains a target model 130(e.g., a machine learning model) to make predictions based on inputdata. For example, the model trainer 110 trains a convolutional neuralnetwork (CNN). The model trainer 110 trains the target model 130 on aset of unlabeled training samples 112, 112U. An unlabeled trainingsample refers to data that does not include any annotations or otherindications of the correct result for the target model 130 which is incontrast to labeled data that does include such annotations. Forexample, labeled data for a target model 130 that is trained totranscribe audio data includes the audio data as well as a correspondingaccurate transcription of the audio data. Unlabeled data for the sametarget model 130 would include the audio data without the transcription.With labeled data, the target model 130 may make a prediction based on atraining sample and then easily compare the prediction to the labelserving as a ground-truth to determine how accurate the prediction was.In contrast, such feedback is not available with unlabeled data.

The unlabeled training samples 112U may be representative of whateverdata the target model 130 requires to make its predictions. For example,the unlabeled training data may include frames of image data (e.g., forobject detection or classification, etc.), frames of audio data (e.g.,for transcription or speech recognition, etc.), and/or text (e.g., fornatural language classification, etc.). The unlabeled training samples112U may be stored on the processing system 10 (e.g., within memoryhardware 14) or received, via a network or other communication channel,from another entity.

The model trainer 110 includes a sample perturber 120. The sampleperturber 120 receives each unlabeled training sample 112U in the set ofunlabeled training samples 112U and perturbs each unlabeled trainingsample 112U to generate a corresponding augmented training sample 112,112A. That is, the sample perturber 120 introduces small, but meaningfulchanges to each unlabeled training sample 112U. For example, the sampleperturber 120 increases or decreases values by a predetermined or randomamount to generate a pair of training samples 112 that includes theoriginal unlabeled training sample 112U and the corresponding augmented(i.e., perturbed) training sample 112A. As another example, when theunlabeled training sample 112U includes a frame of image data, thesample perturber 120 may rotate the image, flip the image, crop theimage, etc. The sample perturber 120 may use any other conventionalmeans of perturbing the data as well.

As discussed in more detail below, the target model 130 (i.e., themachine learning model the active learning model trainer 110 istraining) is initially trained on a small set of labeled trainingsamples 112, 112L and/or unlabeled training samples 112U. This quicklyprovides the target model 130 with rough initial predictioncapabilities. This minimally-trained target model 130 receives, for eachunlabeled training sample 112U, the unlabeled training sample 112U andthe corresponding augmented training sample 112A. The target model 130,using the unlabeled training sample 112U, generates a predicted label132, 132P_(U). The predicted label 132P_(U) represents the targetmodel's prediction based on the unlabeled training sample 112U and themodel's training to this point. The target model 130, using theaugmented training sample 112A, generates another predicted label 132,132P_(A). The predicted label 132P_(A) represents the target model'sprediction based on the augmented training sample 112A and the model'straining to this point. Note that the target model 130 typically is notconfigured to process both the unlabeled training sample 112U and theaugmented training sample 112A simultaneously, and instead processesthem sequentially (in either order) to first, generate a firstprediction label 132P with either one of the unlabeled training sample112U or the augmented training sample 112A, and second, generate asecond prediction label 132P with the other one of the unlabeledtraining sample 112U or the augmented training sample 112A.

The active learning model trainer 110 includes an inconsistencydeterminer 140. The inconsistency determiner 140 receives bothpredictions 132P_(U), 132P_(A) for each pair of samples 112 for eachunlabeled training sample 112U in the set of unlabeled training samples112U. The inconsistency determiner 140 determines an inconsistency value142 that represents variance between the predicted label 132P_(U) of theunlabeled training sample 112U and the predicted label 132P_(A) of theaugmented training sample 112A. That is, a large inconsistency value 142indicates that the unlabeled training sample 112U produces largeunsupervised loss when the target model 130 converges. Conversely, asmall inconsistency value 142 indicates that the unlabeled trainingsample 112U produces a small unsupervised loss when the target model 130converges. In some examples, the greater the difference between thepredicted labels 132P_(U), 132P_(A), the greater the associatedinconsistency value 142.

A sample selector 150 receives the inconsistency value 142 associatedwith each of the unlabeled training samples 112U. The sample selectorsorts the unlabeled training samples 112U in descending order based onthe inconsistency values 142 and selects a current set of unlabeledtraining samples 112U_(T) from the sorted unlabeled training samples112U. That is, the sample selector 150 selects a threshold number ofunlabeled training samples 112U_(T) based on their respectiveinconsistency values 142 to form a current set of unlabeled trainingsamples 112U_(T). The sample selector 150 obtains, for each unlabeledtraining sample 112U_(T), a ground truth label 132G. The ground truthlabels 132G are labels that are empirically determined by anothersource. In some implementations, an oracle 160 determines the groundtruth labels 132G of the unlabeled training samples 112U_(T).Optionally, the oracle 160 is a human annotator or other human agent.

The sample selector 150 may send the selected unlabeled training samples112U_(T) to the oracle 160. The oracle 160, in response to receiving theunlabeled training samples 112U_(T), determines or otherwise obtains theassociated ground truth label 132G for each unlabeled training sample112U_(T). The unlabeled training samples 112U_(T), combined with theground truth labels 132G, form labeled training samples 112L and may bestored with other labeled training samples 112L (e.g., the labeledtraining samples 112L that the model trainer 110 used to initially trainthe target model 130). That is, the model trainer 110 may select acurrent set of labeled training samples 112L that includes the selectedunlabeled training samples 110U_(T) paired with the corresponding groundtruth labels 132G.

The model trainer 110 trains (e.g., retrains or fine-tunes), using thecurrent set of labeled training samples 112L (i.e., the selectedunlabeled training samples 112U_(T) and the corresponding ground truthlabels 132G), the target model 130. In some implementations, the modeltrainer 110 trains, using the current set of labeled training samples112L and a proper subset of unlabeled training samples 112U_(P) from theset of unlabeled training samples 112U, the target model 130. The propersubset of unlabeled training samples 112U_(P) may include each unlabeledtraining sample 112U that was not part of any set of unlabeled trainingsamples 112U_(T) (i.e., unlabeled training samples 112U selected toobtain the corresponding ground truth label 132G). Put another way, themodel trainer 110 may obtain the proper subset of unlabeled trainingsamples 112U_(P) from the set of unlabeled training samples 112U byremoving the threshold number of unlabeled training samples 112U_(T)from the set of unlabeled training samples 112U.

The model trainer 110 may also include in the training any previouslylabeled training samples 112L (i.e., from initial labels or fromprevious active learning cycles). Thus, the model trainer 110 may trainthe target model 130 on all labeled training samples 112L (i.e., thecurrent set of labeled training samples 110L in addition to anypreviously labeled training samples 112L) and all remaining unlabeledtraining samples 112U (i.e., the set of unlabeled training samples 112Uminus the selected unlabeled training samples 112U_(T)) viasemi-supervised learning. That is, in some examples, the active learningmodel trainer 110 completely retrains the target model 130 using all ofthe unlabeled training samples 112U and labeled training samples 112L.In other examples, the active learning model trainer incrementallyretrains the target model 130 using only the newly obtained labeledtraining samples 112L. As used herein, training the target model 130 mayrefer to completely retraining the target model 130 from scratch or someform of retraining/fine-tuning the target model 130 by conductingadditional training (with or without parameter changes such as byfreezing weights of one or more layers, adjusting learning speed, etc.).

The model trainer 110 may repeat the process (i.e., perturbing unlabeledtraining samples 112U, determining inconsistency values 142, selectingunlabeled training samples 112U_(T), obtaining ground truth labels 132G,etc.) for any number of active learning cycles. For example, the activelearning model trainer 110 repeats training of the target model 130 (andsubsequently growing the set of labeled training samples 112L) for apredetermined number of cycles or until the target model 130 reaches athreshold effectiveness or until a labeling budget is satisfied. In thisway, the model trainer 110 gradually increases the number of labeledtraining samples 112L until the number of samples is sufficient to trainthe target model 130.

Referring now to FIG. 2 , in some examples, the inconsistency value 142for each unlabeled training sample 112U in the threshold number ofunlabeled training samples 110U_(T) is greater than the inconsistencyvalue 142 for each unlabeled training sample 112U not selected from thesorted unlabeled training samples 112U in the set of unlabeled trainingsamples 112U. In this example, a schematic view 200 shows that theinconsistency determiner 140 sorts the inconsistency values 142, 142 a—nfrom the most inconsistent value 142 a (i.e., the highest inconsistentvalue 142) to the least inconsistent value 142 n (i.e., the lowestinconsistent value). Each inconsistency value 142 has a correspondingunlabeled training sample 112U, 112Ua-n. Here, the most inconsistentvalue 142 a corresponds to the unlabeled training sample 112Ua while theleast inconsistent value 142 n corresponds to the unlabeled trainingsample 112Un. In this example, the sample selector 150 selects the fiveunlabeled training samples 112U with the five most inconsistent values142 as the current set of unlabeled training samples 112U_(T). It isunderstood that five is merely exemplary, and the sample selector 150may select any number of unlabeled training samples 112U. Thus, thethreshold number of unlabeled training samples 112U_(T) may be less thana cardinality of the set of unlabeled training samples 112U. In someimplementations, the sample selector 150 selects a first M number (e.g.,five, ten, fifty, etc.) of unlabeled training samples 112U from thesorted unlabeled training samples 112U in the set of unlabeled trainingsamples 112U as the threshold number of training samples 12U_(T).

The selected unlabeled training samples 112U are passed to the oracle160 to retrieve the corresponding ground truth labels 132G. Continuingwith the illustrated example, the oracle 160 determines a correspondingground truth label 132G for each of the five unlabeled training samples112U_(T). The model trainer 110 may now use these five labeled trainingsamples 112L (i.e., the five corresponding pairs of unlabeled trainingsamples 112U and ground truth labels 132G) to train or retrain orfine-tune the target model 130.

Referring now to FIGS. 3A-C, in some examples, the model trainer 110provides initial training of the untrained target model 130 during aninitial active learning cycle (i.e., the first active learning cycle).As shown in schematic view 300 a (FIG. 3A), in some implementations, aninitial set selector 310 randomly selects a random set of unlabeledtraining samples 112U_(R) from the set of unlabeled training samples112U. The initial set selector 310 also obtains correspondingground-truth labels 132G_(R) for each unlabeled training sample 112U_(R)in the random set of unlabeled training samples 112U_(R). The modeltrainer 110 may train, using the random set of unlabeled trainingsamples 112U_(R) and the corresponding ground-truth labels 132G_(R) (toform a set of labeled training samples 112L_(R)), the machine learningmodel 130. That is, in some implementations, prior to the target model130 receiving any training, the model trainer 110 randomly selects asmall set (relative to the entire set) of unlabeled training samples112U_(R) and obtains the corresponding ground-truth labels 132G_(R) toprovide initial training of the target model 130.

Because the random set of unlabeled training samples 112U_(R) is bothrandom and small, the training of the target model 130 is likelyinsufficient. To further refine a starting set of labeled trainingsamples 112L to initially train the target model, the model trainer 110may identify a candidate set of unlabeled training samples 112U_(C) fromthe set of unlabeled training samples 112U (e.g., fifty samples, onehundred samples, etc.). A cardinality of the candidate set of trainingsamples 112U_(C) may be less than a cardinality of the set of unlabeledtraining samples 112U. For example, as shown in a schematic view 300 bof FIG. 3B, the initial set selector 310 may receive inconsistencyvalues 142 from the inconsistency determiner 140 based on predictedlabels 132P_(U) from the target model 130 and select the candidate setof unlabeled training samples 112U_(C) based on the inconsistency values142 of each unlabeled training sample 112U. That is, the model trainer110 identifies the candidate set of unlabeled training samples 112U_(C)by determining the inconsistency value 142 for each unlabeled trainingsample 112U of the set of unlabeled training samples 112U. Optionally,the candidate set of unlabeled training samples 112U_(C) includes halfof the unlabeled training samples 112U in the set of unlabeled trainingsamples 112U with the highest corresponding inconsistency values 142.

After receiving corresponding ground truth Labels 132G_(C), the initialset selector 310 may determine a first cross entropy 320 between adistribution of ground-truth labels 132G and a distribution of predictedlabels 132P_(U) generated using the machine learning model 130 for thetraining samples in the candidate set of unlabeled training samples112U_(C). The initial set selector 310 may also determine a second crossentropy 330 between the distribution of ground-truth labels 132G and adistribution of predicted labels 132P_(U) generated by the machinelearning model 130 for the training samples in the set of unlabeledtraining samples 112U. That is, the first cross entropy 320 is betweenthe actual label distribution for the candidate set 112U_(C) and thepredicted label distribution for the candidate set 112U_(C) while thesecond cross entropy 330 is between the same actual label distributionfor the candidate set 112U_(C) as the first cross entropy 320 and thepredicted label distribution for the entire set of unlabeled trainingsamples 112U. Cross entropy may be thought of generally as calculatingthe differences between two distributions.

Referring now to FIG. 3C and decision tree 300 c, in someimplementations, the initial set selector 310 determines whether thefirst cross entropy 320 is greater than or equal to the second crossentropy 330 at step 350. In this scenario, the differences between theactual label distributions and the predicted label distribution for thecandidate set 112U_(C) is greater than or equal to the differencesbetween the actual label distributions and the predicted labeldistributions for the entire set of unlabeled training samples 112U.When the candidate set 112U_(C) is selected at least in part based onthe largest inconsistency values 142 (i.e., the model trainer 110determines the inconsistency value 142 for each unlabeled trainingsample 112U of the set of unlabeled training samples 112U), the modeltrainer 110 is selecting unlabeled training samples 112U that the model130 is most uncertain about (i.e., samples 112U that tend to be far awayfrom the data distribution) and thus indicates better performance.

Because of this indication, when the first cross entropy 320 is greaterthan or equal to the second cross entropy 330, at step 360, the initialset selector 310 may select the candidate set of unlabeled trainingsamples 112U_(C) as a starting size of the current set of labeledtraining samples 112L. With the target model 130 initially trained, themodel trainer 110 may proceed with subsequent active learning cycles asdescribed above (FIGS. 1 and 2 ).

When the first cross entropy 320 is less than the second cross entropy330 (i.e., an indication of poor target model 130 performance), thecurrent candidate set 112U_(C) is inadequate for initial training of thetarget model 130. In this example, the initial set selector 310, at step370, randomly selects an expanded set of training samples 112U_(E) fromthe unlabeled set of training samples 112U. At step 380, the initial setselector 310 updates the candidate set of unlabeled training samples112U_(C) to include the expanded set of training samples 112U_(E)randomly selected from the unlabeled set of training samples 112U. Insome examples, the initial set selector 310 updates the unlabeled set oftraining samples 112U by removing each training sample from the expandedset of training samples 112Us from the unlabeled set of training samples112U. This ensures that unlabeled training samples 112U are notduplicated.

During an immediately subsequent active learning cycle (i.e., the nextactive learning cycle), at step 390, the initial set selector 310 mayrepeat each of the previous steps with the updated candidate set112U_(C). For example, the initial set selector 310 determines the firstcross entropy 320 between the distribution of ground truth labels 132Gand the distribution of predicted labels 132P generated using themachine learning model 130 for the training samples in the updatedcandidate set of unlabeled training samples 112U_(C). The initial setselector 310 also determines the second cross entropy 330 between thedistribution of ground truth labels 132G and the distribution ofpredicted labels 132P generated using the machine learning model 130 forthe training samples in the updated candidate set of unlabeled trainingsamples 112U_(C). The initial set selector 310 again determines whetherthe first cross entropy 320 is greater than or equal to the second crossentropy 330. When the first cross entropy 320 is greater than or equalto the second cross entropy 330, the initial set selector selects theupdated candidate set of unlabeled training samples 112U_(C) as astarting size for initially training the machine learning model 130.When the first cross entropy 320 is less than the second cross entropy330, the initial set selector 310 may continue to iteratively expand thecandidate set 112U_(C) until the first cross entropy 320 is greater thanor equal to the second cross entropy 330 (i.e., indicates that thetarget model 130 performance is sufficient).

FIG. 4 is a flowchart of an exemplary arrangement of operations for amethod 400 for active learning via a sample consistency assessment. Themethod 400, at step 402, includes obtaining, by data processing hardware12, a set of unlabeled training samples 112U. During each of a pluralityof active learning cycles, for each unlabeled training sample 112U inthe set of unlabeled training samples 112U, the method 400, at step 404,includes perturbing, by the data processing hardware 12, the unlabeledtraining sample 112U to generate an augmented training sample 112A. Atstep 406, the method 400 includes generating, by the data processinghardware 12, using a machine learning model 130 configured to receivethe unlabeled training sample 112U and the augmented training sample112A as inputs, a predicted label 132P_(U) for the unlabeled trainingsample 112U and a predicted label 132P_(A) for the augmented trainingsample 112A.

At step 408, the method 400 includes determining, by the data processinghardware 12, an inconsistency value 142 for the unlabeled trainingsample 112U. The inconsistency value 142 represents variance between thepredicted label 132P_(U) for the unlabeled training sample 112U and thepredicted label 132P_(A) for the augmented training sample 112A. Themethod 400, at step 410, includes sorting, by the data processinghardware 12, the unlabeled training samples 112U in the set of unlabeledtraining samples 112U in descending order based on the inconsistencyvalues 142.

At step 412, the method 400 includes obtaining, by the data processinghardware 12, for each unlabeled training sample 112U in a thresholdnumber of unlabeled training samples 112U_(T) selected from the sortedunlabeled training samples 112U in the set of unlabeled training samples112U, a ground truth label 132G. The method 400, at step 414, includesselecting, by the data processing hardware 12, a current set of labeledtraining samples 112L, the current set of labeled training samples 112Lincluding each unlabeled training sample 112U in the threshold number ofunlabeled training samples 112U_(T) selected from the sorted unlabeledtraining samples 112U in the set of unlabeled training samples 112Upaired with the corresponding obtained ground truth label 132G. At step416, the method 400 includes training, by the data processing hardware12, using the current set of labeled training samples 112L and a propersubset of unlabeled training samples 112U_(P) from the set of unlabeledtraining samples 112U, the machine learning model 130.

Thus, the model trainer 110 may identify unlabeled training samples 112Uthat have a high potential for performance improvement relative to theperformance improvement of other unlabeled training samples 112U withoutincreasing (and potentially reducing) the total labeling cost (e.g.,expenditure of computation resources, consumption of human annotatortime, etc.). The model trainer 110 also determines an appropriate sizefor an initial or starting set of labeled training examples 112L byusing a cost-efficient approach that avoids overhead stemming fromstarting with large sets of labeled data samples 112L while alsoensuring optimal model performance with a limited number of labeledtraining samples 112L (i.e., compared to conventional techniques).

FIG. 5 is a schematic view of an example computing device 500 that maybe used to implement the systems and methods described in this document.The computing device 500 is intended to represent various forms ofdigital computers, such as laptops, desktops, workstations, personaldigital assistants, servers, blade servers, mainframes, and otherappropriate computers. The components shown here, their connections andrelationships, and their functions, are meant to be exemplary only, andare not meant to limit implementations of the inventions describedand/or claimed in this document.

The computing device 500 includes a processor 510, memory 520, a storagedevice 530, a high-speed interface/controller 540 connecting to thememory 520 and high-speed expansion ports 550, and a low speedinterface/controller 560 connecting to a low speed bus 570 and a storagedevice 530. Each of the components 510, 520, 530, 540, 550, and 560, areinterconnected using various busses, and may be mounted on a commonmotherboard or in other manners as appropriate. The processor 510 canprocess instructions for execution within the computing device 500,including instructions stored in the memory 520 or on the storage device530 to display graphical information for a graphical user interface(GUI) on an external input/output device, such as display 580 coupled tohigh speed interface 540. In other implementations, multiple processorsand/or multiple buses may be used, as appropriate, along with multiplememories and types of memory. Also, multiple computing devices 500 maybe connected, with each device providing portions of the necessaryoperations (e.g., as a server bank, a group of blade servers, or amulti-processor system).

The memory 520 stores information non-transitorily within the computingdevice 500. The memory 520 may be a computer-readable medium, a volatilememory unit(s), or non-volatile memory unit(s). The non-transitorymemory 520 may be physical devices used to store programs (e.g.,sequences of instructions) or data (e.g., program state information) ona temporary or permanent basis for use by the computing device 500.Examples of non-volatile memory include, but are not limited to, flashmemory and read-only memory (ROM)/programmable read-only memory(PROM)/erasable programmable read-only memory (EPROM)/electronicallyerasable programmable read-only memory (EEPROM) (e.g., typically usedfor firmware, such as boot programs). Examples of volatile memoryinclude, but are not limited to, random access memory (RAM), dynamicrandom access memory (DRAM), static random access memory (SRAM), phasechange memory (PCM) as well as disks or tapes.

The storage device 530 is capable of providing mass storage for thecomputing device 500. In some implementations, the storage device 530 isa computer-readable medium. In various different implementations, thestorage device 530 may be a floppy disk device, a hard disk device, anoptical disk device, or a tape device, a flash memory or other similarsolid state memory device, or an array of devices, including devices ina storage area network or other configurations. In additionalimplementations, a computer program product is tangibly embodied in aninformation carrier. The computer program product contains instructionsthat, when executed, perform one or more methods, such as thosedescribed above. The information carrier is a computer- ormachine-readable medium, such as the memory 520, the storage device 530,or memory on processor 510.

The high speed controller 540 manages bandwidth-intensive operations forthe computing device 500, while the low speed controller 560 manageslower bandwidth-intensive operations. Such allocation of duties isexemplary only. In some implementations, the high-speed controller 540is coupled to the memory 520, the display 580 (e.g., through a graphicsprocessor or accelerator), and to the high-speed expansion ports 550,which may accept various expansion cards (not shown). In someimplementations, the low-speed controller 560 is coupled to the storagedevice 530 and a low-speed expansion port 590. The low-speed expansionport 590, which may include various communication ports (e.g., USB,Bluetooth, Ethernet, wireless Ethernet), may be coupled to one or moreinput/output devices, such as a keyboard, a pointing device, a scanner,or a networking device such as a switch or router, e.g., through anetwork adapter.

The computing device 500 may be implemented in a number of differentforms, as shown in the figure. For example, it may be implemented as astandard server 500 a or multiple times in a group of such servers 500a, as a laptop computer 500 b, or as part of a rack server system 500 c.

Various implementations of the systems and techniques described hereincan be realized in digital electronic and/or optical circuitry,integrated circuitry, specially designed ASICs (application specificintegrated circuits), computer hardware, firmware, software, and/orcombinations thereof. These various implementations can includeimplementation in one or more computer programs that are executableand/or interpretable on a programmable system including at least oneprogrammable processor, which may be special or general purpose, coupledto receive data and instructions from, and to transmit data andinstructions to, a storage system, at least one input device, and atleast one output device.

A software application (i.e., a software resource) may refer to computersoftware that causes a computing device to perform a task. In someexamples, a software application may be referred to as an “application,”an “app,” or a “program.” Example applications include, but are notlimited to, system diagnostic applications, system managementapplications, system maintenance applications, word processingapplications, spreadsheet applications, messaging applications, mediastreaming applications, social networking applications, and gamingapplications.

These computer programs (also known as programs, software, softwareapplications or code) include machine instructions for a programmableprocessor, and can be implemented in a high-level procedural and/orobject-oriented programming language, and/or in assembly/machinelanguage. As used herein, the terms “machine-readable medium” and“computer-readable medium” refer to any computer program product,non-transitory computer readable medium, apparatus and/or device (e.g.,magnetic discs, optical disks, memory, Programmable Logic Devices(PLDs)) used to provide machine instructions and/or data to aprogrammable processor, including a machine-readable medium thatreceives machine instructions as a machine-readable signal. The term“machine-readable signal” refers to any signal used to provide machineinstructions and/or data to a programmable processor.

The processes and logic flows described in this specification can beperformed by one or more programmable processors, also referred to asdata processing hardware, executing one or more computer programs toperform functions by operating on input data and generating output. Theprocesses and logic flows can also be performed by special purpose logiccircuitry, e.g., an FPGA (field programmable gate array) or an ASIC(application specific integrated circuit). Processors suitable for theexecution of a computer program include, by way of example, both generaland special purpose microprocessors, and any one or more processors ofany kind of digital computer. Generally, a processor will receiveinstructions and data from a read only memory or a random access memoryor both. The essential elements of a computer are a processor forperforming instructions and one or more memory devices for storinginstructions and data. Generally, a computer will also include, or beoperatively coupled to receive data from or transfer data to, or both,one or more mass storage devices for storing data, e.g., magnetic,magneto optical disks, or optical disks. However, a computer need nothave such devices. Computer readable media suitable for storing computerprogram instructions and data include all forms of non-volatile memory,media and memory devices, including by way of example semiconductormemory devices, e.g., EPROM, EEPROM, and flash memory devices; magneticdisks, e.g., internal hard disks or removable disks; magneto opticaldisks; and CD ROM and DVD-ROM disks. The processor and the memory can besupplemented by, or incorporated in, special purpose logic circuitry.

To provide for interaction with a user, one or more aspects of thedisclosure can be implemented on a computer having a display device,e.g., a CRT (cathode ray tube), LCD (liquid crystal display) monitor, ortouch screen for displaying information to the user and optionally akeyboard and a pointing device, e.g., a mouse or a trackball, by whichthe user can provide input to the computer. Other kinds of devices canbe used to provide interaction with a user as well; for example,feedback provided to the user can be any form of sensory feedback, e.g.,visual feedback, auditory feedback, or tactile feedback; and input fromthe user can be received in any form, including acoustic, speech, ortactile input. In addition, a computer can interact with a user bysending documents to and receiving documents from a device that is usedby the user; for example, by sending web pages to a web browser on auser's client device in response to requests received from the webbrowser.

A number of implementations have been described. Nevertheless, it willbe understood that various modifications may be made without departingfrom the spirit and scope of the disclosure. Accordingly, otherimplementations are within the scope of the following claims.

What is claimed is:
 1. A computer-implemented method executed by dataprocessing hardware that causes the data processing hardware to performoperations comprising: obtaining a set of unlabeled training samples;for each particular unlabeled training sample in the set of unlabeledtraining samples: generating, using a machine learning model and theparticular unlabeled training sample, a corresponding first prediction;generating, using the machine learning model and a modified unlabeledtraining sample, a corresponding second prediction, the modifiedunlabeled training sample based on the particular unlabeled trainingsample; and determining a corresponding difference between thecorresponding first prediction and the corresponding second prediction;selecting, based on the corresponding differences, a subset of the setof unlabeled training samples; for each particular unlabeled trainingsample in the subset of the set of unlabeled training samples: obtaininga corresponding ground truth label for the particular unlabeled trainingsample; and generating a corresponding labeled training sample based onthe particular unlabeled training sample paired with the correspondingground truth label; and training the machine learning model using thecorresponding labeled training samples.
 2. The method of claim 1,wherein a number of unlabeled training samples in the subset of the setof unlabeled training samples is less than a cardinality of the set ofunlabeled training samples.
 3. The method of claim 1, wherein selecting,based on the corresponding differences, the subset of the set ofunlabeled training samples comprises selecting unlabeled trainingsamples having a corresponding difference satisfying a threshold.
 4. Themethod of claim 1, wherein selecting, based on the correspondingdifferences, the subset of the set of unlabeled training samplescomprises selecting a threshold number of the unlabeled training sampleshaving the largest corresponding differences.
 5. The method of claim 1,wherein the operations further comprise, during an initial activelearning cycle: randomly selecting a random set of unlabeled trainingsamples from the set of unlabeled training samples; for each particularunlabeled training sample in the random set of unlabeled trainingsamples, obtaining a corresponding ground truth label; and training themachine learning model using the random set of unlabeled trainingsamples and the corresponding ground truth labels.
 6. The method ofclaim 5, wherein the operations further comprise, during the initialactive learning cycle: identifying a candidate set of unlabeled trainingsamples from the set of unlabeled training samples, wherein acardinality of the candidate set of unlabeled training samples is lessthan a cardinality of the set of unlabeled training samples; determininga first cross entropy between a distribution of ground truth labels anda distribution of predicted labels generated using the machine learningmodel for the unlabeled training samples in the candidate set ofunlabeled training samples; determining a second cross entropy between adistribution of ground truth labels and a distribution of predictedlabels generated using the machine learning model for the unlabeledtraining samples in the set of unlabeled training samples; determiningthat the first cross entropy is greater than or equal to the secondcross entropy; and based on determining that the first cross entropy isgreater than or equal to the second cross entropy, selecting thecandidate set of unlabeled training samples as a starting size forinitially training the machine learning model.
 7. The method of claim 6,wherein identifying the candidate set of unlabeled training samples fromthe set of unlabeled training samples comprises determining thecorresponding difference for each unlabeled training sample of the setof unlabeled training samples.
 8. The method of claim 7, wherein theoperations further comprise, when the first cross entropy is less thanthe second cross entropy: randomly selecting an expanded set ofunlabeled training samples from the set of unlabeled training samples;updating the candidate set of unlabeled training samples to include theexpanded set of unlabeled training samples randomly selected from theset of unlabeled training samples; updating the set of unlabeledtraining samples by removing each unlabeled training sample from theexpanded set of unlabeled training samples from the set of unlabeledtraining samples; and during an immediately subsequent active learningcycle: determining the first cross entropy between a distribution ofground truth labels and a distribution of predicted labels generatedusing the machine learning model for the unlabeled training samples inthe updated candidate set of unlabeled training samples; determining thesecond cross entropy between the distribution of ground truth labels anda distribution of predicted labels generating using the machine learningmodel for the unlabeled training samples in the updated candidate set ofunlabeled training samples; determining that the first cross entropy isgreater than or equal to the second cross entropy; and based ondetermining that the first cross entropy is greater than or equal to thesecond cross entropy, selecting a size of the updated candidate set ofunlabeled training samples as a starting size for initially training themachine learning model.
 9. The method of claim 1, wherein the machinelearning model comprises a convolutional neural network.
 10. The methodof claim 1, wherein the corresponding difference between thecorresponding first prediction and the corresponding second predictionrepresents a variance between the corresponding first prediction and thecorresponding second prediction.
 11. A system comprising: dataprocessing hardware; and memory hardware in communication with the dataprocessing hardware, the memory hardware storing instructions that whenexecuted on the data processing hardware cause the data processinghardware to perform operations comprising: obtaining a set of unlabeledtraining samples; for each particular unlabeled training sample in theset of unlabeled training samples: generating, using a machine learningmodel and the particular unlabeled training sample, a correspondingfirst prediction; generating, using the machine learning model and amodified unlabeled training sample, a corresponding second prediction,the modified unlabeled training sample based on the particular unlabeledtraining sample; and determining a corresponding difference between thecorresponding first prediction and the corresponding second prediction;selecting, based on the corresponding differences, a subset of the setof unlabeled training samples; for each particular unlabeled trainingsample in the subset of the set of unlabeled training samples: obtaininga corresponding ground truth label for the particular unlabeled trainingsample; and generating a corresponding labeled training sample based onthe particular unlabeled training sample paired with the correspondingground truth label; and training the machine learning model using thecorresponding labeled training samples.
 12. The system of claim 11,wherein a number of unlabeled training samples in the subset of the setof unlabeled training samples is less than a cardinality of the set ofunlabeled training samples.
 13. The system of claim 11, whereinselecting, based on the corresponding differences, the subset of the setof unlabeled training samples comprises selecting unlabeled trainingsamples having a corresponding difference satisfying a threshold. 14.The system of claim 11 wherein selecting, based on the correspondingdifferences, the subset of the set of unlabeled training samplescomprises selecting a threshold number of the unlabeled training sampleshaving the largest corresponding differences.
 15. The system of claim11, wherein the operations further comprise, during an initial activelearning cycle: randomly selecting a random set of unlabeled trainingsamples from the set of unlabeled training samples; for each particularunlabeled training sample in the random set of unlabeled trainingsamples, obtaining a corresponding ground truth label; and training themachine learning model using the random set of unlabeled trainingsamples and the corresponding ground truth labels.
 16. The system ofclaim 15, wherein the operations further comprise, during the initialactive learning cycle: identifying a candidate set of unlabeled trainingsamples from the set of unlabeled training samples, wherein acardinality of the candidate set of unlabeled training samples is lessthan a cardinality of the set of unlabeled training samples; determininga first cross entropy between a distribution of ground truth labels anda distribution of predicted labels generated using the machine learningmodel for the unlabeled training samples in the candidate set ofunlabeled training samples; determining a second cross entropy between adistribution of ground truth labels and a distribution of predictedlabels generated using the machine learning model for the unlabeledtraining samples in the set of unlabeled training samples; determiningthat the first cross entropy is greater than or equal to the secondcross entropy; and based on determining that the first cross entropy isgreater than or equal to the second cross entropy, selecting thecandidate set of unlabeled training samples as a starting size forinitially training the machine learning model.
 17. The system of claim16, wherein identifying the candidate set of unlabeled training samplesfrom the set of unlabeled training samples comprises determining thecorresponding difference for each unlabeled training sample of the setof unlabeled training samples.
 18. The system of claim 17, furthercomprising, when the first cross entropy is less than the second crossentropy: randomly selecting an expanded set of unlabeled trainingsamples from the set of unlabeled training samples; updating thecandidate set of unlabeled training samples to include the expanded setof unlabeled training samples randomly selected from the set ofunlabeled training samples; updating the set of unlabeled trainingsamples by removing each unlabeled training sample from the expanded setof unlabeled training samples from the set of unlabeled trainingsamples; and during an immediately subsequent active learning cycle:determining the first cross entropy between a distribution of groundtruth labels and a distribution of predicted labels generated using themachine learning model for the unlabeled training samples in the updatedcandidate set of unlabeled training samples; determining the secondcross entropy between the distribution of ground truth labels and adistribution of predicted labels generating using the machine learningmodel for the unlabeled training samples in the updated candidate set ofunlabeled training samples; determining that the first cross entropy isgreater than or equal to the second cross entropy; and based ondetermining that the first cross entropy is greater than or equal to thesecond cross entropy, selecting a size of the updated candidate set ofunlabeled training samples as a starting size for initially training themachine learning model.
 19. The system of claim 11, wherein the machinelearning model comprises a convolutional neural network.
 20. The systemof claim 11, wherein the corresponding difference between thecorresponding first prediction and the corresponding second predictionrepresents a variance between the corresponding first prediction and thecorresponding second prediction.